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—Her  consider^generalized  linear  models  in  which  the  linear  predictor  is  of 
additive  semi-parametric  form,  linear  in  most  of  the  explanatory  variables  but 
with  an  arbitrary  functional  dependence  on  the  remainder.  Estimation  of  the 
parameters  and  the  non-parametric  curve  in  the  model  is  approached  by 
maximizing  a  penalized  likelihood.  Two  explicit  iterative  algorithms  are 
presented.  The  first,  which  operates  in  0(n)  time  per  iteration,  applies 
where  there  is  just  one  variable  entering  the  model  in  a  non-parametric 
fashion,  and  an  integrated  squared  second  derivative  penalty  is  used.  An 
example  in  logistic  regression  of  tumour  prevalence  is  given.  The  second 
algorithm  is  for  the  much  more  general  case  of  a  regression  model  specified  as 
an  arbitrary  composite  log-likelihood  function,  permitting  nonlinear 


dependence  and  several  splined  variables 
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SIGNIFICANCE  AND  EXPLANATION 


Generalized  linear  models  are  used  to  analyse  regression  relationships  in 
both  continuous  and  discrete  data.  They  include  Normal  theory  linear 
regression  and  analysis  of  variance ,  the  Poisson  log-linear  model,  and 
binomial  logit  and  probit  analysis.  A  basic  assumption  is  that  the 
explanatory  variables  enter  the  model  through  the  values  of  a  linear 
predictor;  the  coefficients  of  these  variables  are  usually  estimated  by 
maximum  likelihood.  However,  this  parametric  structure  is  not  always 
appropriate,  and  in  the  absence  of  any  experience,  information  or  theory 
concerning  the  form  of  a  regression  relationship,  a  non-parametric  approach 
may  be  preferable. 

This  paper  considers  the  intermediate  position  where  the  statistician  has 
some  faith  in  the  parametric  model  but  for  some  suspected  inhomogeneity  with 
respect  to  a  few  extraneous  variables  (often  representing  time  or  space).  The 
resulting  semi-parametric  model  is  analysed  by  maximum  penalized  likelihood, 
which  controls  the  roughness  of  the  otherwise  arbitrary  functional  dependence 
on  the  extraneous  variables.  Two  explicit  algorithms  are  presented.  One 
applies  to  the  restrictive  but  important  special  case  where  one  extraneous 
variable  is  handled  using  cubic  splines;  its  implementation  is  highly 
efficient,  and  is  illustrated  here  applied  to  tumour  prevalence  data.  The 
second  algorithm  applies  to  a  much  wider  class  of  regression  models  specified 
as  composite  log- likelihood  functions.  Various  derived  statistics  are 
constructed,  including  those  leading  to  automatic  choice  of  the  appropriate 
degree  of  roughness  in  the  non-parametric  part  of  the  model. 


Semi-parametric  Generalized  Linear  Models 

Peter  J.  Green  and  Brian  S.  Yandell 


1.  Introduction 

When  the  form  of  a  regression  relationship  with  respect  to  some  but  not  all  of  the 
explanatory  variables  is  unknown,  the  statistician  is  caught  in  a  quandary.  Should 
parametric  models  be  abandoned  altogether,  thus  losing  the  opportunity  of  estimating 
parameters  of  real  interest  and  sacrificing  efficiency  in  estimation  and  prediction,  or 
should  the  extraneous  variables  be  forced  into  a  parametric  model  by  imposing  a  possi¬ 
bly  inappropriate  functional  form  without  adequate  justification? 

A  compromise  is  possible  using  the  idea  of  semi-parametric  modelling.  This  has 
been  considered  by  several  authors  in  varying  degrees  of  generality;  see  for  example 
Rice  (1981),  Green,  Jennison  and  Seheult  (1983),  Wahba  (1984),  and  Green  (1985b). 
In  the  present  context  of  generalized  linear  models  we  consider  replacing  the  familiar 
linear  predictors  T),-  =  xjp  by  the  more  general  predictors 

9,-xfP  +  W  (1) 

with  x,-  a  p-vector  of  explanatory  variables  for  the  ith  observation,  (J  the 
corresponding  regression  coefficients,  tf  the  scalar  or  vector  of  extraneous  variables, 
and  y  a  function  or  curve  whose  form  is  not  specified.  As  a  simple  example,  ima¬ 
gine  a  binomial  logistic  regression  in  which  the  "intercept"  term  is  believed  to  vary  in 
time  or  with  geographical  location.  In  section  5  we  consider  in  detail  an  example 
where  a  previous  investigator  has  been  unsure  about  the  precise  dependence  of  the 
binary  response  (presence  of  tumour)  on  one  of  four  explanatory  variables  (age  at 
death). 

Straightforward  maximization  of  the  log-likelihood  function  L  ,  which  we  will 
write  in  the  composite  form  L(0(fl,Y))  to  emphasize  the  roles  of  predictors,  parame¬ 
ters,  and  unknown  curve,  is  no  longer  appropriate  as  a  method  of  estimation.  This 
leads  to  overfitting  in  the  absence  of  any  constraints  on  P  .  Indeed,  it  typically 
renders  the  parameters  fi  unidentifiable.  But  progress  is  possible  by  maximizing 
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instead  a  penalized  version  of  the  log-likelihood,  if  we  are  willing  to  place  weak  con¬ 
straints  on  the  form  of  y  by  assuming  that  it  is  smooth.  Thus  we  maximize  the  penal¬ 
ized  log-likelihood 

£(6(p,y))  -  WtT)  (2) 

where  the  penalty  functional  J  is  some  numerical  measure  of  the  "roughness"  of  y . 
This  might  be  adopted  on  ad-hoc  grounds  (for  example,  an  integrated  squared  deriva¬ 
tive  of  y ),  or  follow  from  a  Bayesian  argument  specifying  a  prior  distribution  for  y . 
The  scalar  A,  is  a  tuning  constant,  used  to  regulate  the  smoothness  of  the  fitted  curve 
y .  Typically  we  try  a  range  of  values  for  X  in  an  exploratory  fashion,  as  well  as 
considering  automatic  choice  based  on  the  data.  One  ultimate  aim  may  be  to  discover 
the  form  of  y  in  the  hope  of  modelling  it  parametrically  in  future. 

2.  Maximum  Penalized  Likelihood  Estimates 

Here  we  will  only  consider  maximization  of  (2)  over  y  in  the  span  of  a  set  of 
q  prescribed  basis  functions  {<(>*;£=  1,2 ,...,<?  }  :  we  write 

Y=  i&Ak 

fc=i 

and  assume  in  addition  that  J  satisfies 

(3) 

k=  1 

for  some  qx.q  non-negative  definite  symmetric  K  .  We  thus  re-write  the  penalized 
log-likelihood  in  the  form 

£(0(P,5))  -  '/jA4rK5  (4) 

to  be  maximized  over  choice  of  the  vectors  P  and  £  . 

This  finite-dimensional  approach  is  not  intended  to  compromise  our  non- 
parametric  assumptions  about  the  curve  y .  The  dimension  q  ,  perhaps  equal  to  n  , 
will  typically  be  too  large  for  parametric  estimation  of  £  to  be  appropriate,  and  the 
basis  functions  will  be  chosen  so  as  not  to  materially  constrain  the  curve,  except 
perhaps  in  fine  detail.  With  certain  penalty  functionals,  for  example  those  used  in 
spline  smoothing,  it  turns  out  that  with  q  =  n  we  are  not  imposing  any  constraints  at 
all  (see  section  3). 

The  semi-parametric  regression  problem  expressed  in  the  general  form  (4)  is  con¬ 
sidered  in  some  detail  by  Green  (1985b),  who  derives  the  following  iterative  scheme 


for  the  maximum  penalized  likelihood  estimates  (MPLEs)  fs  and  £  •  Suppose  we 
have  trial  estimates  P  and  £  •  Using  these,  compute  the  n-vector  of  scores  u  and 
the  nxn  information  matrix  A  : 


u  = 


3L 
30  * 


A  = 


In  the  case  of  a  generalized  linear  model,  u  and  A  may  be  expressed  as 


4 


A  =  diag 


in  the  notation  of  the  GLIM3  manual  (Baker  and  Nelder,  1978),  where  the  vectors 
p  ,  t2  ,  and  S  are  computed  with  0  replacing  the  linear  predictor.  We  also  need 
the  nxp  and  nxq  matrices  of  derivatives 


E  = 


30 


Then  updated  estimates  (p*,  4*)  are  obtained  as  the  solution  to  the  linear  system 


DtAD  DrAE 

p* 

'  * 

Dr 

ErAD  ErAE  +  XK 

V 

et 

0  • 

(5) 


where 


Y  =  A-1u  +  Dp  +  E%. 


This  scheme  is  based  on  the  Newton-Raphson  method  with  Fisher  scoring,  and  these 
updating  equations  can  be  seen  to  combine  the  iteratively  reweighted  least  squares 
equations  for  P  used  in  GLIM  (see  also  Green  (1984)),  with  ridge-regression  type 
equations  for  £  (O’Sullivan,  Yandell  and  Raynor,  Jr.,  1984;  Yandell,  1985). 

As  they  stand,  the  equations  (5)  are  not  ideal  for  practical  computation.  It  is  the 
purpose  of  this  paper  to  derive  various  algorithms  implementing  this  scheme,  and  to 
illustrate  semi-parametric  modelling  applied  to  both  real  and  simulated  data  sets. 


3.  The  One-Dimensional  Case,  with  Cubic  Spline  Smoothing 

For  a  restrictive  but  very  useful  special  case,  consider  a  generalized  linear  model 
with  predictors  {9,}  given  by  (1),  in  which  {  t:  }  are  one-dimensional,  and  suppose 
that  the  roughness  penalty  J(y)  takes  the  form  j  (y  "(t))2  dt .  This  allows  one  addi¬ 
tional  explanatory  variable  to  enter  in  a  non-parametric  fashion;  the  form  of  penalty 
used  ensures  that  the  dependence  on  this  variable  is  "visually  smooth".  Some  aspects 


of  the  purely  non-parametric  version  of  this  problem  were  discussed  by  Silverman 
(1985). 

For  simplicity,  we  suppose  that  the  {  f,  }  are  distinct  and  ordered,  ti<t2<...<tn  , 
but  relaxing  this  requirement  presents  no  great  difficulty.  It  is  well  known  (Reinsch, 
1967)  that  the  y  maximizing  (2)  for  any  fixed  P  and  X  is  a  natural  cubic  spline 
with  knots  at  {/*-},  that  the  space  of  such  splines  has  dimension  n ,  and  that  we 
may  choose  a  basis  for  this  space  with  §k(t$  =  . 

In  this  case  the  notation  used  above  simplifies:  q-n  ,  D  and  E  are  constant 
matrices  with  D  having  i  th  row  equal  to  xj  and  E  being  the  identity,  A  is 
diagonal,  and  Y  =  A-1u  +  0  .  Further,  it  is  implicit  in  Reinsch  (1967)  that  K  may 
be  written  ArW_1A  where  A  is  the  (n-2)xn  matrix  taking  second  differences: 

A  a  =  hi  ,  A  n+i  —  —  (  hi  +  h-l+\  ),  Ay+2  —  hi+\. 

and  W  is  the  symmetric  tridiagonal  matrix  of  order  (n-2)  : 

Wi-u  =  Win  =  hj/6,  Wu  =  (hi  +  hi+1)/ 3. 

where  ht  =  tM  -  r, .  The  important  point  about  this  decomposition  is  that  A  and  W 
are  banded. 

One  possible  algorithm  implementing  (5)  involves  an  inner  iteration  between  the 
pair  of  equivalent  equations 

p*  =  (DrAD)  -1DrA(Y-£*) 

K=  S  (Y-Dp*)  (6) 

where  S  =  (A  +  XK)_1A  .  This  will  always  converge  (Green,  1985a).  But  further 
iteration  can  be  avoided  by  eliminating  from  (5)  to  give 

p*  =  (DTA(I-S)D)-1DrA(I-S)Y.  (7) 

Solution  of  this  small  (pxp)  system  for  p*  is  followed  by  use  of  (6)  to  obtain  . 
From  the  updated  (P*,  ^*)  we  recompute  0  ,  thence  u  and  A  ,  and  the  cycle  is 
repeated  to  convergence. 

This  approach  is  highly  practicable,  and  very  economical,  since  apart  from  solv¬ 
ing  the  linear  equations  (7),  and  some  matrix  multiplications,  we  only  need  to  apply 
the  "smoothing  operator"  S  =  (A  +  XK)-1A  to  form  SY  and  SD  .  But  a  conse¬ 
quence  of  the  special  structure  of  K  mentioned  above  is  that  S  can  be  applied  to  a 
vector  in  only  0(n)  operations.  We  use  a  minor  modification  of  the  version  of 


Reinsch’s  algorithm  given  by  De  Boor  (1978)  to  obtain  a  very  fast  implementation. 

An  almost  identical  approach  may  be  adopted  more  generally,  with  any  penalty 
functional  for  which  S  =  (A  +  XK)_1A  may  pre-multiply  a  vector  in  0(n)  time. 
This  could  include  splines  of  different  orders,  penalties  based  on  discrete  differences, 
and  "moving  average"  smoothers. 

4.  Goodness-of-fit,  Standard  Errors,  and  Choice  of  X 

Goodness-of-fit  can  be  assessed  globally,  as  in  generalized  linear  models,  by  the 
deviance  A  =  2{supe  L(0)  -  L(0($,|))}  where  ($,£)  are  the  MPLEs.  Locally,  it  is 
measured  by  residuals:  either  the  deviance  residuals,  the  signed  square  roots  of  the 
individual  contributions  to  A  ,  or  in  GLIM  fashion  as 


This  is  all  standard,  but  we  do  need  a  new  concept  of  degrees-of-freedom  to  assign  to 
A  .  It  turns  out  (Green,  1985b)  that  the  appropriate  value,  not  in  general  an  integer,  is 
given  by 

v  =  n  -  rr(S)  -  tr  [(D^I-S^r^Aa-S)^] .  (9) 

This  is  an  approximation  to  the  asymptotic  expectation  of  A  ,  and  reduces  to  the  usual 
n  -  rank(D)  when  the  non-parametric  part  of  the  model  is  omitted.  In  the  non- 
parametric  case,  this  v  has  been  used  informally  for  linear  models  (Eubank,  1984; 
Eubank,  1985)  and  generalized  linear  models  (O’Sullivan,  Yandell  and  Raynor,  Jr., 
1984;  O’Sullivan,  1985;  Yandell,  1985). 

Similar  somewhat  approximate  asymptotics  lead  to  an  estimated  variance  matrix 
for  $  of  the  form 

(DTA(I-S)D)-1DTA(I-S)2D(DrA(I-S)D)-1  (10) 

from  which  standard  errors  may  be  calculated.  In  the  absence  of  the  appropriate  distri¬ 
bution  theory,  neither  the  deviance  nor  the  standard  errors  should  be  used  in  formal 
significance  tests,  at  present,  but  they  do  seem  to  provide  adequate  guidelines  for 
model  selection. 

Computation  of  these  quantities  follows  naturally  from  the  algorithm  outlined  in 
section  3,  and  consists  of  solving  pxp  linear  systems  following  the  repeated  applica¬ 
tion  of  S  to  D  .  The  only  part  of  this  that  is  not  simple  to  implement  in  O(n)  time 


is  the  first  trace  term,  rr(S) ,  which  in  our  present  program  takes  about  In2  multipli¬ 
cations  or  divisions.  However  an  0(n)  algorithm  for  this  computation  in  linear  spline 
smoothing  has  recently  been  announced  by  O’Sullivan  (1985),  and  we  will  adapt  this 
to  the  present  context. 

As  for  automatic  choice  of  X  ,  Wahba’s  generalized  cross-validation  (GCV) 
method  (Wahba,  1977),  which  uses  an  invariant  modification  of  a  predictive  mean- 
squared  error  criterion,  may  be  adapted  to  this  situation.  A  quadratic  approximation  to 
the  quantity  to  be  minimized  (over  X  )  is  simply  A/v2  ,  so  no  further  computation  is 
involved.  We  use  a  simple  one-dimensional  search  over  X  to  find  the  minimum. 
Other  approaches  to  the  automatic  choice  of  X  would  be  possible,  for  example  the 
empirical  Bayesian  methods  proposed  by  Leonard  (1982). 

5.  Examples 

Logistic  Regression,  and  Tumour  Prevalence  Data 

Dinse  and  Lagakos  (1983)  consider  logistic  regression  models  for  data  from  a 
U.S.  National  Toxicology  Program  bioassay  of  a  flame  retardant.  Data  on  127  male 
and  192  female  rats  exposed  to  various  doses  of  the  agent  consist  of  a  binary  response 
variable  (  y  )  indicating  presence  or  absence  of  bile  duct  hyperplasia  at  death,  and  four 
explanatory  variables:  log  dose  ( xy  ),  initial  weight  ( x2  ),  cage  position  ( x3  ),  and  age 
at  death  ( t ).  Dinse  and  Lagakos  express  some  doubts  as  to  whether  the  fourth  of 
these  variables  enters  the  model  linearly,  so  they  consider  fitting  higher-order  polyno¬ 
mials,  or  step  functions  based  on  age  intervals.  A  reasonable  alternative  seemed  to  be 
the  semi-parametric  approach  described  here,  which  allows  age  at  death  ( t )  to  enter 
the  binomial  logistic  model  in  a  non-parametric  fashion,  whilst  still  allowing  estima¬ 
tion  of  the  log  dose  regression  coefficient. 

The  results  of  our  analyses  are  presented  graphically  (see  Figure  1)  for  various 
values  of  the  tuning  constant  X  .  In  each  plot,  the  upper  two  traces  display  on  the 
same  scale  the  fitted  parametric  and  non-parametric  components  of  the  predictor, 
YPifij  7(0  =  »  plotted  against  r,- .  The  lower  panel  displays  the 

corresponding  residuals  from  (8).  (The  slightly  bizarre  appearance  of  the  residual  plot 
is  due  to  the  binary  nature  of  the  response  variable.)  In  both  data  sets  there  were  a 
considerable  number  of  ties  in  values  of  {/,•},  which  we  have  broken  arbitrarily  with 
small  deterministic  displacements.  This  seems  not  to  lead  to  any  numerical  instability 
in  our  program,  it  avoided  some  modifications  to  the  coding  to  handle  coincident 

-6- 
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For  the  data  on  the  male  rats,  Figures  1(a)  and  1(b)  demonstrate  the  effect  of 
using  very  small  and  large  values  of  X  respectively,  suggesting  under-  and  over- 
smoothing.  These  values  are  respectively  0.01  and  100  times  the  automatic  GCV 
choice  of  X  =  1380,  for  which  the  relevant  plot  is  Figure  1(c).  This  indicates  a  non¬ 
linear  dependence  on  t ,  but  note  that  the  left-hand  part  of  the  curve,  up  to  the  first 
turning  value,  is  based  on  rather  little  data.  The  parameter  estimates,  with  approximate 
standard  errors,  are  -0.13910.148,  -0.012±0.022,  and  0.06810.151,  and  we  thus  agree 
with  Dinse  and  Lagakos  about  the  lack  of  significance  of  the  regression  coefficients. 

In  the  case  of  the  female  rats,  the  GCV  method  for  choice  of  X  is  not  so  well 
behaved.  Whilst  there  is  a  turning  value  of  A/v2  at  X  =  6.24,  this  is  only  a  local 
minimum,  and  the  GCV  criterion  seems  to  decrease  to  0  for  very  small  X  .  Figure 
1(d)  displays  the  fitted  values  and  residuals  for  X  =  6.24,  suggestive  of  a  complicated 
nonlinear  dependence  on  t .  Our  parameter  estimates  are  0.49210.137,  0.04010.015 
and  0.27010.158.  Dinse  and  Lagakos  obtain  =  0.55410.20  using  a  model  linear  in 
t ,  implying  that  if  we  have  correctly  identified  the  form  of  y (r)  then  their  estimate  is 
both  slightly  biased  and  inefficient 

It  may  be  of  interest  to  report  some  details  of  the  performance  of  our  algorithm 
applied  to  these  data.  For  the  male  rats,  with  the  GCV  choice  of  X  ,  four  iterations 
were  needed  to  converge  from  initial  estimates  corresponding  to  empirical  logits  to  a 
point  where  neither  P  nor  the  deviance  changed  by  more  than  KT4  .  Excluding  the 
calculation  of  rr(S)  (see  section  4),  the  computation  time  was  1.26  seconds  on  a 
VAX  1 1/780.  For  the  female  rats,  10  iterations  were  required,  and  the  time  was  3.77 
seconds. 

Poisson  Log-linear  Regression,  with  Simulated  Data 

As  a  demonstration  that  our  approach  can  properly  identify  a  smooth  y (r)  ,  we 
analyzed  a  simulated  data  set,  with  {  yi ;  i  =  1,2, ...,200  }  distributed  independently 
with  Poisson  distributions  with  means  {  exp(9j)  }  ,  where 

5 

0,  =  YPtfij  +  1  +  sin(r,). 
y=i 

The  {  q  }  were  chosen  independently  and  uniformly  on  the  interval  (0,3ji)  ,  so  that 
the  true  curve  y (r)  =  1  +  sin  t  has  three  turning  values  in  the  range  of  the  data.  The 
design  matrix  D  was  taken  as  that  for  40  replicates  of  a  randomized  complete  blocks 
design  on  5  treatments,  each  successive  5  ordered  {  r,  }  forming  a  block.  The  true 


p  was  (  1,  0.5,  0,  -0.5,  -1),  so  that  the  Poisson  means  varied  between  about  0.4  and 

20. 

Figure  2(a)  corresponds  to  the  GCV  choice  of  X  =  3.13:  the  fitted  y(t)  has  the 
correct  form,  ana  the  parameter  estimates  are  (  1.031,  0.517,  0.020,  -0.583,  -0.984) 
with  standard  errors  (  0.058,  0.066,  0.078,  0.098,  0.116).  (Clearly  for  identifiability  a 
constraint  must  be  placed  on  P  :  we  used  =  1  )•  Figure  2(b)  demonstrates  that 
when  the  tuning  constant  X  is  set  much  too  high,  in  this  case  100  times  the  GCV 
value,  the  fitted  curve  yW  cannot  match  the  structure  in  the  data,  which  is  therefore 
forced  into  the  residuals. 

6.  Using  B-splines 

For  an  application  of  spline  smoothing  to  the  penalized  log-likelihood  (4)  more 
general  than  that  of  section  3,  suppose  that  the  {rj  remain  one-dimensional,  and  that 
the  roughness  penalty  takes  the  form 

m  =  \[im)m2dt. 

In  this  case,  an  alternative  set  of  basis  splines  is  useful,  namely  the  natural  B-splines 
of  order  2m  (De  Boor,  1978;  Schumaker,  1981).  These  are  defined  on  any  sequence 


Figure  2.  Fitted  Log  Rates  and  Residuals  for  Simulated  Poisson  Data 


of  knots  $!<  •••  <sq  as  piecewise  polynomials  of  degree  (2m- 1)  between  the  knots, 
of  degree  (m-1)  outside  (  sq  ),  and  with  (2m-2)  continuous  derivatives.  This 
basis  leads  to  stable,  economical  computing  as  B -splines  are  non-negative  and  have 
limited  support.  In  the  cubic  spline  case  (  m  =  2  ),  {<j>* ;  k=  3,  ■  •  •  ,q- 2}  are  each 
non-negative  only  on  (  sk_2,  sk+2  ),  whilst  4>i  >  4>2  ,  4*^-1  and  §q  are  linear  outside 
(  sq  ).  The  matrix  K  ,  which  has  the  form 

Kq  =  j Wvtijr'm, 

is  banded,  and  so  an  algorithm  based  on  the  equations  (7)  and  (6)  can  be  implemented 
in  0(n )  time;  see  also  Silverman  (1985). 

This  approach  is  particularly  useful  in  the  case  where  n  is  very  large  and  it  is 
desirable  that  the  number  q  of  basis  functions  be  much  smaller.  We  are  then  only 
attempting  the  restricted  minimization  of  (4),  but  with  say  q  =  50  or  100  knots 
equally  spaced  to  cover  the  range  of  {/,}  ,  this  restriction  is  not  of  practical  impor¬ 
tance. 

7.  An  Algorithm  for  the  General  Case 

The  linear  system  (5)  can  be  expressed  as  finding  P  and  £  to  minimize 

||BT(Y-Dp-E^)||2  +  (11) 

in  which  A=BBr .  Suppose  that  K  is  of  rank  r  <  q  .  Two  matrices,  J  and  T  , 
with  q  rows  and  full  column  ranks  r  and  q-  r ,  respectively,  can  be  formed  such 
that  JrKJ  =  I ,  TrKT  =  0  ,  and  JrT  =  0  (see  below).  Rewriting  £  as 

5  =  T5  +  Je,  (12) 

with  e  and  8  of  lengths  r  and  q  -  r  ,  respectively,  (11)  becomes 

||Br(Y  -  [D:ET]  [|]  -  EJe)||2  +  Xere. 

A  Householder  decomposition  (Dongarra  et  al.,  1979)  allows  one  to  separate  the 
solution  of  P  and  8  from  that  of  e  .  In  other  words,  we  decompose 

Q[Br[D:ET]  =  R,  Q[Br[D:ET]  =  0, 

in  which  Q  =  [Qi:Q2]  is  orthogonal  and  R  is  nonsingular,  upper  triangular,  and  of 
full  rank  p+q-r  .  The  linear  problem  becomes:  minimize  the  sum  of 


and 


||Q{BrY  -  r(|)  -  Q[BrEJe||2 


(13) 


||QfBTY  -  Q£btE  ’-II2  +  teTe.  (14) 

The  first  term  (13)  can  be  set  to  zero  by  appropriate  choice  of  p  and  5  given  e  . 
If  we  define  Y*  =  Q2BTY  and  Z  =  Q2BrEJ  ,  (14)  becomes  a  problem  of  minimiz¬ 
ing  •• 

||Y*  -  Ze|[2  +  \btb 


which  is  an  ordinary  ridge  regression  problem.  See  Bates  and  Wahba  (1983);  Golub, 
Heath  and  Wahba  (1979).  The  solution  is 

e*  =  (ZTZ+U)-lZTY*. 


The  other  parameters  are  solved  as 


r 


R-1Q]BT(Y-EJe*). 


One  then  computes  using  (12)  and  proceeds  with  the  nonlinear  iteration  discussed 
in  section  2. 


Decomposition  of  K 

One  need  only  compute  J  and  T  once  as  K  depends  on  the  model  only 
through  {  t,  }  .  Using  a  pivoted  Cholesky  decomposition  (Dongarra  et  al.,  1979), 

PrKP  =  i/L, 

with  L  of  dimension  rxq  and  P  a  permutation  matrix  such  that  the  first  r 
columns  of  KP  are  linearly  independent.  A  Householder  decomposition  of  LT 
yields 

F[I/  =  G,  F£l/  =  0, 

in  which  F  =  [F1:F2]  is  orthogonal  and  G  is  nonsingular,  upper  triangular,  and  of 
full  rank  r  .  It  is  known  (Dongarra  et  al.,  1979)  that  G_1F|  is  the  Moore-Penrose 
inverse  of  LT  .  Therefore,  one  can  construct  the  matrices  J  and  T  as 

T  =  PF2  and  J  =  PF,G"r. 

A  further  refinement  is  possible  if  the  partial  derivative  matrices  E  and  D  can 
be  written  as  E  =  E*M  and  D  =  D*M  ,  in  which  M  depends  only  on  {  t,-  }  .  For 


instance,  with  a  B-spline  basis  with  model  (1),  =  <►*(/,) .  One  can  left  multiply  by 

M  exactly  once,  forming  T*  =  MT  and  J*  =  MJ  each  with  n  rows,  and  replace 
E  and  D  by  E*  and  D*  in  subsequent  computations. 

Auxiliary  Statistics 

Auxiliary  statistics  can  be  constructed  in  the  general  case  in  a  similar  fashion  to 
Section  4,  with  S  in  equations  (9)  and  (10)  replaced  by 

S  =  E(EtAE  +  XK)-1ErA. 

Alternatively,  using  the  notation  of  this  section,  one  can  show  that  (9)  becomes 

v  =  n-p-q+r  -  fr|zrZ(ZrZ+XI)-1]. 

The  variance  (10)  can  be  reexpressed  as 
*  " 

var  |  =  R-1Q|[l  +  BrEJ(Z7’Z+XI)'1Z7’Z(ZrZ+Xir1JrE7’B]Q1R-7'. 

•  4 

O’Sullivan  (1985)  observed  that  the  trace  and  the  diagonal  "leverage"  elements  of  the 
hat  matrix  for  the  B-spline  basis  can  be  computed  in  0(r)  multiplications/divisions  by 
using  a  Cholesky  decomposition  of  (  Z7Z+XI  )  as  in  Silverman  (1985). 

8.  Related  Work  in  Progress 

We  have  nearly  completed  an  implementation  of  the  algorithm  for  the  general 
case,  which  will  allow  specification  of  models  through  subroutine  evaluation  of  the 
forms  of  A  ,  D  ,  E  ,  K  and  u  .  This  software  will  be  in  the  public  domain  and 
uses  UNPACK  (Dongarra  et  al.,  1979)  This  work  is  related  to  a  larger  programming 
effort  involving  Douglas  Bates,  Grace  Wahba,  and  Mary  Lindstrom. 

Yandell  and  Bates  (unpublished)  have  also  observed  that  one  can  use  the  singular 
value  decomposition  for  ridge  regression  problems  (Bates  and  Wahba,  1983;  Golub, 
Heath  and  Wahba,  1979)  to  iterate  on  the  "optimal"  choice  of  tuning  constant  X  (in 
the  sense  of  minimizing  the  generalized  cross  validation  function)  within  each  linear 
step.  Thus  X  changes  with  each  nonlinear  iteration,  as  do  P  and  £  .  Empirically, 
convergence  takes  about  the  same  number  of  steps  as  it  would  for  a  fixed  X  near  the 


optimal  value.  Unfortunately,  the  singular  value  decomposition  is  expensive,  taking 
0(r^)  multiplications/divisions.  However,  it  may  be  possible  to  combine  this 
approach  with  the  Cholesky  decomposition  approach  used  by  O’Sullivan,  Yandell  and 
Raynor,  Jr.  (1984)  and  Silverman  (1985)  to  strike  a  healthy  compromise  between 
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heavy  computation  and  finding  the  optimal  amount  of  smoothing. 
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